fix/review-issue-2 #4

MQ37 · 2025-01-13T20:05:34Z

PR for #2

Changes

Code Changes

Renamed the input parameter from url to startURL for consistency with Website Content Crawler.
Removed the default value in actor_input = await Actor.get_input() or {'url': 'https://docs.apify.com/'} to ensure the Actor fails if no value is provided.
Added logs to improve the UX when Website Content Crawler is running.
Used the walrus operator in record = await kvstore.get_record(store_id) if record is None to improve code readability.
Fixed typo in the comment "# changed by get_crawler_actor_config with defailt value 1".
Updated CRAWLER_CONFIG to only set values that differ from the default.
Added an example to the docstring for the render(data: dict) -> str function to help users understand its usage.
Added logging for store = await Actor.open_key_value_store() and await store.set_value('llms.txt', output) to enhance UX.
Removed the TODO comment: # TODO: use path or LLM suggestions to group pages into sections # noqa: TD003 and recommended creating an issue for future improvements.

README Changes

Mentioned the use of Website Content Crawler, explained its purpose, and provided a link to the Actor.
Ensured consistent formatting for all list items in the README.
Updated list bullet point A simple, AI-focused structure to help coders, researchers, and AI models easily access and use website content. to include emphasis for consistency.
Replaced all caps in titles like Features of llms.txt Generator or Content Extraction to use proper capitalization.

Open Points

Clarify why get_description_from_html is used: Sometimes the Website Content Crawler does not return meta descriptions in the dataset, even when available in the HTML. So, I extract the descriptions myself for now, but I can try to fix this issue in the crawler.
Add a memory limit for the crawler Actor to ensure it works with the free tier: A 4 GB crawler memory limit is already hardcoded, and this has been mentioned in the README.
Consider replacing the dummy example of llms.txt with a real one: Added a real output example of llms.txt generated by the Actor for docs.apify.com and kept the proposed structure example on top.
Do not use all caps in the README.md: Replaced all titles and non-entity name words with lowercase letters instead of using title case. Is this correct? @jirispilka

…d, mention memory limit in readme, readme mention website content crawler, crawler config only non defaults, added logging and minor code improvements

jirispilka

I'm sorry, a couple more comments as previously, I only did a quick pass.

main.py

In the comments, avoid general references to Apify Actors. Be specific to this actor. For example: "Main entry point for this Apify Actor."
Do not raise RuntimeError; use Apify.fail instead.
Ensure you handle the case when the dataset is empty.

helpers.py

I'm not a fan of one-line functions, as they make the code harder to read. For example:

def render_llms_txt(data: dict) -> str:
    """Renders the `llms.txt` file using the provided data."""
    return render(data)

renderer.py

Rather than concatenating strings directly, add items to an array and then use join to concatenate. It is more efficient.

/tests

I generally prefer pytest over unittest. pytest has a cleaner syntax, supports fixtures, and, for me, it is easier to work with. Also, it used in python-crawlee, which we (at least myself) considered as Apify standard.

janbuchar

I agree with the points made by @jirispilka, and I added some nits of my own. But overall, it looks solid!

janbuchar · 2025-01-15T10:47:07Z

.actor/input_schema.json

      "type": "string",
-      "description": "The URL of website you want to get the llm.txt generated for.",
+      "description": "The URL from which the crawler will start to generate the llms.txt file.",
      "editor": "textfield",


It's better to use the requestListSources editor in this case - see https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1#array. You may use apify.RequestList to process it afterwards.

Unless you really want to support just one URL every time, of course.

Because of how the format of llms.txt is specified I think the single url input is more suitable - we treat it like an index for the whole site (or sub-site) so the url should act as a root (or sub-root).

janbuchar · 2025-01-15T10:51:36Z

src/helpers.py

@@ -13,6 +13,8 @@
 if TYPE_CHECKING:
    from apify_client.clients import KeyValueStoreClientAsync

+logger = logging.getLogger('apify')


You can just use Actor.log instead.

Using Actor.log in main.py now, but in helpers.py pytest prints warnings because of non existent event loop, so I will keep the logger there.

src/main.py

src/renderer.py

…s, render using join, switch to pytest, actor.log in main

MQ37 · 2025-01-15T13:30:59Z

I'm sorry, a couple more comments as previously, I only did a quick pass.

main.py

* In the comments, avoid general references to Apify Actors. Be specific to this actor. For example: "Main entry point for this Apify Actor."

* Do not raise `RuntimeError`; use `Apify.fail` instead.

* Ensure you handle the case when the dataset is empty.

helpers.py

* I'm not a fan of one-line functions, as they make the code harder to read. For example:
  ```python
  def render_llms_txt(data: dict) -> str:
      """Renders the `llms.txt` file using the provided data."""
      return render(data)
  ```

renderer.py

* Rather than concatenating strings directly, add items to an array and then use `join` to concatenate. It is more efficient.

/tests

* I generally prefer `pytest` over `unittest`. `pytest` has a cleaner syntax, supports fixtures, and, for me, it is easier to work with. Also, it used in `python-crawlee`, which we (at least myself) considered as Apify standard.

refactored the comments
Apify.fail is called by the context manager automatically on raise, which I am using
added check for empty dataset from WCC call - actor will fail
removed the render_llms_txt from helpers.py
using string join in renderer now
switched over to pytest

jirispilka · 2025-01-15T19:51:15Z

README.md


-The **llms.txt Generator Actor** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA.
+The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.


Suggested change

The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.

The **llms.txt generator** is Apify Actor that helps you extract essential website content and generate an [llms.txt](https://llmstxt.org/) file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This Actor leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.

capitalized 'Actor' and added llms.txt org link to top description

jirispilka

One nit: tool -> Actor

MQ37 · 2025-01-16T10:23:34Z

One nit: tool -> Actor

changed tool -> Actor in README

jirispilka · 2025-01-16T10:39:55Z

Thanks for the changes!

MQ37 added 5 commits January 13, 2025 20:19

renamed input from url to startUrl, removed default input fail instea…

3c97281

…d, mention memory limit in readme, readme mention website content crawler, crawler config only non defaults, added logging and minor code improvements

remove "actor" from the name

13b5f36

fix readme naming title case (all caps)

0b00e65

fix input schema field type, fix logging

d3729ef

fix typo in readme

fdacc57

MQ37 self-assigned this Jan 13, 2025

format code

82ac951

jirispilka requested review from janbuchar and jirispilka January 14, 2025 12:26

jirispilka requested changes Jan 14, 2025

View reviewed changes

janbuchar reviewed Jan 15, 2025

View reviewed changes

refactor comments, raise on empty dataset, removed render from helper…

182e388

…s, render using join, switch to pytest, actor.log in main

MQ37 requested review from jirispilka and janbuchar January 15, 2025 13:37

jirispilka reviewed Jan 15, 2025

View reviewed changes

jirispilka approved these changes Jan 15, 2025

View reviewed changes

MQ37 added 2 commits January 16, 2025 11:17

fix readme, tool -> Actor, capitalize Actor

27bb281

add top description llms.txt org link

3b5f9b1

MQ37 merged commit c274710 into master Jan 16, 2025
1 check passed

MQ37 deleted the fix/review-issue-2 branch January 16, 2025 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/review-issue-2 #4

fix/review-issue-2 #4

MQ37 commented Jan 13, 2025

jirispilka left a comment

janbuchar left a comment

janbuchar Jan 15, 2025

MQ37 Jan 15, 2025

janbuchar Jan 15, 2025

MQ37 Jan 15, 2025

MQ37 commented Jan 15, 2025 •

edited

Loading

jirispilka Jan 15, 2025

MQ37 Jan 16, 2025

jirispilka left a comment

MQ37 commented Jan 16, 2025

jirispilka commented Jan 16, 2025


		The llms.txt Generator Actor is an Apify tool that helps you extract essential website content and generate an llms.txt file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA.
		The llms.txt generator is an Apify tool that helps you extract essential website content and generate an llms.txt file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the llms.txt.

fix/review-issue-2 #4

fix/review-issue-2 #4

Conversation

MQ37 commented Jan 13, 2025

Changes

Code Changes

README Changes

Open Points

jirispilka left a comment

Choose a reason for hiding this comment

janbuchar left a comment

Choose a reason for hiding this comment

janbuchar Jan 15, 2025

Choose a reason for hiding this comment

MQ37 Jan 15, 2025

Choose a reason for hiding this comment

janbuchar Jan 15, 2025

Choose a reason for hiding this comment

MQ37 Jan 15, 2025

Choose a reason for hiding this comment

MQ37 commented Jan 15, 2025 • edited Loading

jirispilka Jan 15, 2025

Choose a reason for hiding this comment

MQ37 Jan 16, 2025

Choose a reason for hiding this comment

jirispilka left a comment

Choose a reason for hiding this comment

MQ37 commented Jan 16, 2025

jirispilka commented Jan 16, 2025

MQ37 commented Jan 15, 2025 •

edited

Loading